Unifying Value Iteration, Advantage Learning, and Dynamic Policy Programming
نویسندگان
چکیده
Approximate dynamic programming algorithms, such as approximate value iteration, have been successfully applied to many complex reinforcement learning tasks, and a better approximate dynamic programming algorithm is expected to further extend the applicability of reinforcement learning to various tasks. In this paper we propose a new, robust dynamic programming algorithm that unifies value iteration, advantage learning, and dynamic policy programming. We call it generalized value iteration (GVI) and its approximated version, approximate GVI (AGVI). We show AGVI’s performance guarantee, which includes performance guarantees for existing algorithms, as special cases. We discuss theoretical weaknesses of existing algorithms, and explain the advantages of AGVI. Numerical experiments in a simple environment support theoretical arguments, and suggest that AGVI is a promising alternative to previous algorithms.
منابع مشابه
Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not ...
متن کاملA Tutorial on Linear Function Approximators for Dynamic Programming and Reinforcement Learning
A Markov Decision Process (MDP) is a natural framework for formulating sequential decision-making problems under uncertainty. In recent years, researchers have greatly advanced algorithms for learning and acting in MDPs. This article reviews such algorithms, beginning with well-known dynamic programming methods for solving MDPs such as policy iteration and value iteration, then describes approx...
متن کاملRegular Policies in Abstract Dynamic Programming
We consider challenging dynamic programming models where the associated Bellman equation, and the value and policy iteration algorithms commonly exhibit complex and even pathological behavior. Our analysis is based on the new notion of regular policies. These are policies that are well-behaved with respect to value and policy iteration, and are patterned after proper policies, which are central...
متن کاملTemporal Difference-based Adaptive policies in Neuro-dynamic Programming
Abstract. Based on temporal difference method in neuro-dynamic programming, an adaptive policy for finite state Markov decision processes with the average reward is constructed under the minorization condition. We estimate the value function by a learning iteration algorithm. And the adaptive policy is specified as an ε-forced modification of the greedy policy for the estimated value and the es...
متن کاملSolving the Dice Game Pig: an introduction to dynamic programming and value iteration
For such a simple dice game, one might expect a simple optimal strategy, such as in Blackjack (e.g., “stand on 17” under certain circumstances, etc.). As we shall see, this simple dice game yields a much more complex and intriguing optimal policy. In our exploration of Pig we will learn about dynamic programming and value iteration, covering fundamental concepts of reinforcement learning techni...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1710.10866 شماره
صفحات -
تاریخ انتشار 2017